Assigment 2:

Table of contents



importing all libreries used in this assingment

Data Preperation:

Importing the data :

City Data Preperation:

Changing the column names to english:

Visualize NaN values:

Fiiling missing data using different logics:

Missing population number data:

We will fill the missing population data by the shape of each settelement. Each shape describes the aproximate type and size of each settelment. We replaced the values by the median of that value

Missing religion data:

We will fill the missing religion data by the same logic.

Nationallity data filling:

we will fill the missing nationallity values using the shape to find which type of settelement that is (jews, arab or other): For jews settelments we will fill 90% jews and others , 10% arab. For jews out of we will use 90% out of "jews & others" . For arab settelments we will fill 90% of population and 10% jews.

Filling police space using subdistrict :

All data from mapping data:

Crime Data Preperation:

Changing the column names to english:

Dealing with NaN values df_crime:

We chose to delete the 'year messege' column for it's uninformative characturisticts .

Renaming values- from hebrew to english:

Data Visualization:

crimes_TOTAL_cities :

Contains total crimes pre persettelment:

Corrolation matrix:

From the corolation matrix we can inffer that the bigger the population - the larger the number of crimes committed. We can also see that this corollation is very strong when it comes to jews and other population. in contrast to the arab population - where a wicker connection between total crime and arab population. We can also see that the districts, subdistricts are highly corelated to the planning committee. We can infer for later work, that not all of the features are needed.

Scatter matrix of features

From the Scatter matrix of features we can see again that the corolation between the population size and the total crimes committed. we can now see the our previous assumption that in jews and other communities the corolation is greater than the corolattion between the arab comunnity and total crimes commited. From this matrix we can see that the differences were merley becuse arab population in general are smaller than the jews and other communities, and the reason for the differences were not made from the type of population, but merly from it's size

Bar plot - 'Settlements with high offences rates in Israel (top-15)':

Here we see the top 15 rated settelmenets in crime presentage over 2014-2019:

Bar plot - 'Crime presentage for districts:':

Here we see the crime presentage in district. We added error bars thst shoes the variance in each district.

3. Clustering

Demographic analysis:

The research question we'll be trying to answer in this section is: The corollation between a settelment size and crime presentage. In other words we expect mixed nationallities settelments to be higher on crime presentage. To do that we'll fit the data of crime using two clustering methods, after we have K clusters we'll view their districts and infer if we indeed found a correlation or not.

Clustering algorithms:

We would like to see the connection berween the settelment size and its crime presentage. First we will prepare the data and visualize it, to see what clustering algorithm we think is prefered.

From the graph we cannot conclude a clear connection, therfore we chose KMeans algorithm to first see what the data means and than, if nedded choose the second algorithm more wisley.

3.1 KMeans algorithm:

Elbow function:

we use The Elbow Method that shows the optimal k for the algorethim which means the number of the clusters that we will get, when we display the 'Sum of squared error' .acording to each K and the connecting between the points we will get and elbow shape the closest k to the elbow it will be the optimal k

Conclusion from KMeans algoritm aboot the connection between settelment size and crime presentage:

We can see that most of the settelments, most of theri crimes are between 0%-10%. But the differnces is the disterbution. We can see that in larger settelments the disterbution in crime pesentege is larger, meaning the larger the settelment size is, the crime presentage disterbution is vary variant, i.e. more values (crime presentage). Werease in smaller settelments, the disrebution is smaller, and most of the small size setelments are mostly between 0%-10% crimes.

GMM algorithm:

We chose GMM clustering from kmean conclusions. we saw that the differnces were in the data desterbution

Conclusions from GMM :

The GMM algorithm supp ort our previous conclusion- lMost of the crimes are between 0%-10% regardless the settelments size (population wise). But- in larger settelment we can see a variant destreterbution in higer crime presentage.

Random Forest:

Choosing features to work with: 1.From the corrolation matrix we saw a high corolation between 'district','subdistrict' and 'Planning Committee'. Therfore we will only choose 'Planning Committee'.

  1. Random Forest algirothm only uses numeric feature, therfore all data will be numeric 3.Altough crime type is not numeric, we need it for future analysis. Therfore we will convert it into numeric values. when its time for final ersult analysis, we will retun the values back.

From data_plot table:

we can see that the precictions for all setellments regarding the type second most common crime is correct. But, the numbers actually predicted for each crime is not 100% accurate.

In the next plot, we can see the differences between the size of the actual crimes in 2019, and the predicted size.

Table visualization:

5. AdaBoost:

Our goal is to predict the overall crimes in 2019 in the following settelments: Keryat Ata, Rosh Pina, Eylat, Sachnin and Beir Seva. Using Adaboost algorithm

Data preperation for adaboost:

train data:

X- all features (but names and features with NaN values) of untsrgeted settelments

y- the year 2019 for the untargeted settelments

test data:

X- all features (but names and features with NaN values) of targeted settelments

y- the year 2019 for the targeted settelments

Optimizing the algorithm:

We are using grid search for finding the best parameters to sent to the model. We will print the features the model found most segnificant (using adaboost regression), and those would be the features and parameters we will use .

6. Police force suggestions:

In the next task we decided to predict police strategy in every city by using RandomForestClassifier. First we calculated all necessary resources in each city, every year. Next, for every pair of years for each city we calculated percent of growth/decrease in this resources, depending on first year in the pair. We decided that the best rate will be 20%, which in our opinion is the optimal. Because lower rate will cause many suggestions for changing amount of resources (which is not happens in our days in police every year), and higher rate will miss important changes.

Our strategy decision was calculated as follows: if amount of resources grew more than by 20% in next year – it will be labeled as 1, decrease by 20% percent will be labeled as -1, and others changes will be labeled as 0.

7. Clustering renovation:

target